A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density
نویسندگان
چکیده
The goal of clustering is to detect the presence of distinct groups in a data set and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly, for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the graph cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples. Supplemental materials for the article, including a R package implementing generalized single linkage clustering, all data sets used in the examples, and R code producing the figures and numerical results, are available online.
منابع مشابه
New Heuristic Algorithms for Solving Single-Vehicle and Multi-Vehicle Generalized Traveling Salesman Problems (GTSP)
Among numerous NP-hard problems, the Traveling Salesman Problem (TSP) has been one of the most explored, yet unknown one. Even a minor modification changes the problem’s status, calling for a different solution. The Generalized Traveling Salesman Problem (GTSP)expands the TSP to a much more complicated form, replacing single nodes with a group or cluster of nodes, where the objective is to fi...
متن کاملتعیین روش نمونه برداری مناسب جهت برآورد تراکم و سطح تاجپوشش درختان زوال یافته بلوط ایرانی (.Quercus brantii Lindl) در منطقه حفاظت شده دینارکوه ایلام
Oak decline as one of the most important environmental problems of Zagros forests, requires proper management to decrease trees dieback and mitigate its effects. This study aimed to find the best sampling method for estimating density and crown canopy of declined oak trees in Zagros Forests. All declined trees in an area of 100 ha of Dinarkooh protected forest were surveyed and trees density, g...
متن کاملChoosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation
1- INTRODUCTION The assessment of watershed sediment load is necessary for controling soil erosion and reducing the potential of sediment production. Different estimates of sediment amounts along with the lack of long-term measurements limits the accessibility to reliable data series of erosion rate and sediment yield. Therefore, the observed data of suspended sediment load could be used to ...
متن کاملWho Should be Interviewed? A Response from Cluster Analysis
Objective: This article presents an application of cluster analysis for social sciences researches especially those studies that have an interview as part of their data collection. This application is more suitable for sequential mixed method researchers who use quantitative data to frame subsequent qualitative subsamples for conducting interviews. Methods: In more detail, the algorithm (i....
متن کاملStudy on the effect of forest stand distribution pattern on results of different estimators of the nearest individual distance method
The Nearest Individual Sampling Method is one of the distance sampling methods for estimating density, canopy cover and height of forest stands. Some distance sampling methods have more than one density estimator that may be skewed to the spatial pattern. Unless the stands of the trees under study have a random spatial pattern. Therefore, the purpose of this study was evaluating the effect of s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007